KMID : 1100520220280010016
|
|
Healthcare Informatics Research 2022 Volume.28 No. 1 p.16 ~ p.24
|
|
Protected Health Information Recognition by Fine-Tuning a Pre-training Transformer Model
|
|
Oh Seo-Hyun
Kang Min Lee Young-Ho
|
|
Abstract
|
|
|
Objectives: De-identifying protected health information (PHI) in medical documents is important, and a prerequisite to deidentification is the identification of PHI entity names in clinical documents. This study aimed to compare the performance of three pre-training models that have recently attracted significant attention and to determine which model is more suitable for PHI recognition.
Methods: We compared the PHI recognition performance of deep learning models using the i2b2 2014 dataset. We used the three pre-training models?namely, bidirectional encoder representations from transformers (BERT), robustly optimized BERT pre-training approach (RoBERTa), and XLNet (model built based on Transformer-XL)?to detect PHI. After the dataset was tokenized, it was processed using an inside-outside-beginning tagging scheme and WordPiecetokenized to place it into these models. Further, the PHI recognition performance was investigated using BERT, RoBERTa, and XLNet.
Results: Comparing the PHI recognition performance of the three models, it was confirmed that XLNet had a superior F1-score of 96.29%. In addition, when checking PHI entity performance evaluation, RoBERTa and XLNet showed a 30% improvement in performance compared to BERT.
Conclusions: Among the pre-training models used in this study, XLNet exhibited superior performance because word embedding was well constructed using the two-stream self-attention method. In addition, compared to BERT, RoBERTa and XLNet showed superior performance, indicating that they were more effective in grasping the context.
|
|
KEYWORD
|
|
Artificial Intelligence, Big Data, Medical Informatics, Data Anonymization, Deep Learning
|
|
FullTexts / Linksout information
|
|
|
|
Listed journal information
|
|
|
|